I’m excited to have this course and hope to learn more on statistics and programming! My GitHub page: https://github.com/ntrmarie/IODS-project
Let’s present the variables included into the dataset:
learning2014 <- read.csv("/Users/Maria/R/IODS-project/data/learning2014.csv")
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# creates the plot matrix for the dataset
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Plots’ description:
The first line of box plots on the upper side presents the gender wise distribution of “age”, “attitude”, “deep”, “stra”, “surf” and “points”. Every plot includes lower and higher quartiles and median. Red colour shows female gender and blue colour - male gender. The first line also shows the gender distribution: the number of females(F) is twice bigger than males(M).
The upper triangle presents coefficients of correlation with gender. The lower triangle presents scatter plots for all the variables and show the relationship between every two variables. The plots based on the diagonal from left upper side to right upper side present the distribution of all the continuous variables in our scope.
The highest correlation coefficients are for the following variables:
In addition: Age distribution is skewed towards young people (mostly 20 years old).
# creates regression model with three variables of highest (absolute) correlation with the target variable
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# shows the summary
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Summary of the model:
Therefore, the hypothesis that these parameters are equal to zero is accepted. It means that explanatory variable in our model doesn’t have a statistically significant relationship with the target variable. Now we need to fit another model without a “surf” variable:
# creates regression model with two variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# shows the summary
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The new interpretation:
Multiple R-squared of the model: model explains 20% of variance in the data as far as \(R^2 = 0.2\).
5.Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
# sets the plots' location
par(mfrow = c(2,2))
# creates diagnostic plots
plot(my_model, which=c(1,2,5))
The scatter plot of residuals vs fitted on the left upper side shows that residuals are equally distributed around zero. Thus, the following assumption is held: the variance around the regression line is the same for all values of the predictor variable “points”.
The Q-Q plot on the right upper side provides a method to check if the normality of errors assumption (underlying the linear regression) is held. In our case it shows a very reasonable fit.
The scatter plot of residuals vs leverage on the left down side shows the impact single observations have on the model. We can definitely see three outliers: 35, 77 and 145. But they don’t really influence the regression line. Therefore, our linear model fits the standards.
The data approach student achievement in secondary education of two Portuguese schools. We are interested in two datasets presenting the performance in Mathematics (math) and Portuguese language (por).
The whole information about data is presented here (https://archive.ics.uci.edu/ml/datasets/Student+Performance).
Having combined two datasets we will analyze the following variables:
# read the data
alc <- read.csv("/Users/Maria/R/IODS-project/data/alc.csv")
dim(alc)
## [1] 370 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The purpose of this part is to study the relationships between high/low alcohol consumption and some of the other variables in the data. Let’s choose 4 variables in the data and present four hypothesis about their relationships with alcohol consumption:
Let’s create the plots to explore our hypothesis:
# count mean for the grades
alc$G <- (alc$G1 + alc$G2 + alc$G3)/3
# check the distribution with bar plots
bar <- select(alc, "G", "freetime", "absences", "famrel")
gather(bar) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
The plots show the following:
# check the distribution with box plots
# for high_use variable TRUE is for students for which 'alc_use' is greater than 2, otherwise FALSE
# G is a mean grade
d1 <- ggplot(alc, aes(x = high_use, y = G)) + geom_boxplot() + ylab("grade")
d2 <- ggplot(alc, aes(x = high_use, y = freetime)) + geom_boxplot()
d3 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot()
d4 <- ggplot(alc, aes(x = high_use, y = famrel)) + geom_boxplot() + ylab("family relations")
d <- plot_grid(d1, d2, d3, d4, labels = c('a', 'b', 'c', 'd'),align="hv")
d
The box plots show the following:
Let’s create a logistic regression model with the binary high/low alcohol consumption variable as the target and these explanatory variables: “freetime”, “G” (grades), “absences”, “famrel”
model <- glm(high_use ~ freetime + G + absences + famrel, data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ freetime + G + absences + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8494 -0.8183 -0.6309 1.1567 2.0665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.29109 0.80769 -0.360 0.718552
## freetime 0.44845 0.12807 3.502 0.000463 ***
## G -0.09783 0.04359 -2.244 0.024810 *
## absences 0.08204 0.02249 3.648 0.000264 ***
## famrel -0.34003 0.13016 -2.612 0.008989 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 411.59 on 365 degrees of freedom
## AIC: 421.59
##
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept) freetime G absences famrel
## -0.29108828 0.44845029 -0.09783016 0.08203815 -0.34002935
odds <- coef(model) %>% exp
confi <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(odds, confi)
## odds 2.5 % 97.5 %
## (Intercept) 0.7474497 0.1514430 3.6313994
## freetime 1.5658836 1.2235216 2.0238304
## G 0.9068029 0.8315906 0.9870153
## absences 1.0854972 1.0407617 1.1372571
## famrel 0.7117494 0.5501574 0.9182071
The model shows the following:
From p-values we can conclude that our variable are statistically significant.
# remove grades (G)
newmodel <- glm(high_use ~ freetime + G + absences + famrel, data = alc, family = "binomial")
summary(newmodel)
##
## Call:
## glm(formula = high_use ~ freetime + G + absences + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8494 -0.8183 -0.6309 1.1567 2.0665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.29109 0.80769 -0.360 0.718552
## freetime 0.44845 0.12807 3.502 0.000463 ***
## G -0.09783 0.04359 -2.244 0.024810 *
## absences 0.08204 0.02249 3.648 0.000264 ***
## famrel -0.34003 0.13016 -2.612 0.008989 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 411.59 on 365 degrees of freedom
## AIC: 421.59
##
## Number of Fisher Scoring iterations: 4
Cross tabulation of predictions versus the actual values:
probabilities <- predict(newmodel, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, freetime, G, absences, famrel, high_use, probability, prediction) %>% tail(10)
## freetime G absences famrel high_use probability prediction
## 361 4 12.666667 3 4 FALSE 0.2993338 FALSE
## 362 3 4.000000 0 4 FALSE 0.3324388 FALSE
## 363 4 4.666667 7 5 TRUE 0.4800836 FALSE
## 364 3 12.333333 1 5 FALSE 0.1454904 FALSE
## 365 4 7.000000 6 4 FALSE 0.4875059 FALSE
## 366 4 7.000000 2 5 FALSE 0.3277964 FALSE
## 367 3 11.666667 2 4 FALSE 0.2170178 FALSE
## 368 1 6.666667 3 1 FALSE 0.3569208 FALSE
## 369 4 12.666667 4 2 TRUE 0.4779205 FALSE
## 370 4 10.666667 2 4 TRUE 0.3236934 FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 90 21
As it can be seen from the table, the model correctly classified 244+21=265 observations and failed with 90+15= 105 observations, the training error is 105/(265+105) = 0.28.
# define average prediction error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2837838
We can see that the results are the same.
# if try to guess
mean(alc$high_use)
## [1] 0.3
So, with simple guessing we can assume that 30% goes to mistake and the model gives 28%. It means that the model is a bit more precise results.
The dataset presents the Housing Values in Suburbs of Boston. It includes 506 rows and 14 columns. IT can be downloaded from MASS package.
The following variables are included:
# access the MASS package
library(MASS)
##
## Присоединяю пакет: 'MASS'
## Следующий объект скрыт от 'package:dplyr':
##
## select
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston) # 506*14
## [1] 506 14
Let’s explore the data and check the distribution
library(ggplot2)
library(GGally)
library(tidyr)
gather(Boston) %>% ggplot(aes(x = value)) + geom_histogram(aes(y = ..density..), colour="black", fill="white", bins = 18,) + facet_wrap("key", scales = "free")
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the plots we can conclude that:
Let’s look at the correlations between variables
library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
From the plots we can the following:
Let’s scale the data
# center and standardize variables
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
We can see the following changes after scaling: all the means equal zero.
Let’s create a categorical variable of the crime rate
# create a quantile vector of crime and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
c <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, label = c, include.lowest = TRUE)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crime variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Let’s divide the data into the train and test sets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
For linear discriminant analysis we use the categorical crime rate as the target variable and all the other variables as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2574257 0.2500000 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 0.85120348 -0.8600048 -0.189442747 -0.8419978 0.41980987 -0.8206055
## med_low -0.06609601 -0.2804086 -0.007331936 -0.5705939 -0.09400262 -0.3730817
## med_high -0.40063686 0.2313413 0.117482837 0.4028488 0.05833814 0.4033137
## high -0.48724019 1.0149946 -0.045188669 1.0515659 -0.42488092 0.8250130
## dis rad tax ptratio black lstat
## low 0.8375084 -0.6808519 -0.7303423 -0.34217801 0.38266602 -0.75418687
## med_low 0.3698385 -0.5511961 -0.4839532 -0.07095353 0.31251449 -0.14298211
## med_high -0.3528981 -0.3894442 -0.2759302 -0.21681607 0.03523159 0.05399066
## high -0.8720169 1.6596029 1.5294129 0.80577843 -0.70205728 0.91572443
## medv
## low 0.48766311
## med_low 0.03678848
## med_high 0.09805773
## high -0.72547068
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.122285585 0.75476638 -0.9666674
## indus -0.003968506 -0.22185564 0.3534593
## chas -0.083443893 -0.04914728 0.2123813
## nox 0.303624411 -0.82272814 -1.4071897
## rm -0.056298967 -0.12333866 -0.1265569
## age 0.276367173 -0.23681479 -0.2074570
## dis -0.114041982 -0.30848184 0.1553613
## rad 3.208143392 1.00202505 -0.1373790
## tax -0.015614045 -0.13722477 0.6440533
## ptratio 0.170602030 0.03670422 -0.3326268
## black -0.151605024 0.05826754 0.1246591
## lstat 0.161457513 -0.27206585 0.4796727
## medv 0.127795788 -0.42949555 -0.1434411
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9539 0.0343 0.0118
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
Let’s draw the LDA plot
plot(lda.fit, dimen = 2, col = classes, pch = classes )
lda.arrows(lda.fit, myscale = 2)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 23 8 1 0
## med_low 1 18 3 0
## med_high 0 7 18 0
## high 0 0 1 22
In general this model shows good results for low and high crime level prediction but doesn’t work so good with the middle rate.
To perform k-means algorithm on our dataset let’s scale the data again
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
summary(boston_scaled2)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Now let’s calculate the distances between the observations with Euclidean distance method
# euclidean distance matrix
dist_eu2 <- dist(boston_scaled2) # function uses Euclidean by default
summary(dist_eu2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
str(dist_eu2)
## 'dist' num [1:127765] 1.94 2.28 2.57 2.69 2.24 ...
## - attr(*, "Size")= int 506
## - attr(*, "Labels")= chr [1:506] "1" "2" "3" "4" ...
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = boston_scaled2)
Now we run k-means algorithm on our dataset. Through that we should investigate what is the optimal number of clusters. To determine the number of clusters we should have a look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes.
The optimal number of clusters is when the value of total WCSS changes radically. Then using two clusters seems to be optimal.
library(GGally)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)
# plot the result
ggpairs(boston_scaled2[0:7], aes(color = as.factor(km$cluster)))
## Warning in cor(x, y): стандартное отклонение нулевое
## Warning in cor(x, y): стандартное отклонение нулевое
## Warning in cor(x, y): стандартное отклонение нулевое
## Warning in cor(x, y): стандартное отклонение нулевое
## Warning in cor(x, y): стандартное отклонение нулевое
## Warning in cor(x, y): стандартное отклонение нулевое
ggpairs(boston_scaled2[8:14], aes(color = as.factor(km$cluster)))
We can see that 2 clusters is enough to show the difference between classes. The most evident differnece is in indus, nox, tax, rad, dis, ptratio.
The dataset includes now the following variables:
Let’s explore the dataset
# check the distribution
library(corrplot)
ggpairs(human)
cor(human) %>% round(2)
## gni leb eye mmr abr prp edu.rat lab.rat
## gni 1.00 0.63 0.62 -0.50 -0.56 0.09 0.43 -0.02
## leb 0.63 1.00 0.79 -0.86 -0.73 0.17 0.58 -0.14
## eye 0.62 0.79 1.00 -0.74 -0.70 0.21 0.59 0.05
## mmr -0.50 -0.86 -0.74 1.00 0.76 -0.09 -0.66 0.24
## abr -0.56 -0.73 -0.70 0.76 1.00 -0.07 -0.53 0.12
## prp 0.09 0.17 0.21 -0.09 -0.07 1.00 0.08 0.25
## edu.rat 0.43 0.58 0.59 -0.66 -0.53 0.08 1.00 0.01
## lab.rat -0.02 -0.14 0.05 0.24 0.12 0.25 0.01 1.00
cor(human) %>% corrplot(method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
summary(human)
## gni leb eye mmr
## Min. : 581 Min. :49.00 Min. : 5.40 Min. : 1.0
## 1st Qu.: 4198 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 11.5
## Median : 12040 Median :74.20 Median :13.50 Median : 49.0
## Mean : 17628 Mean :71.65 Mean :13.18 Mean : 149.1
## 3rd Qu.: 24512 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 190.0
## Max. :123124 Max. :83.50 Max. :20.20 Max. :1100.0
## abr prp edu.rat lab.rat
## Min. : 0.60 Min. : 0.00 Min. :0.1717 Min. :0.1857
## 1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.7264 1st Qu.:0.5984
## Median : 33.60 Median :19.30 Median :0.9375 Median :0.7535
## Mean : 47.16 Mean :20.91 Mean :0.8529 Mean :0.7074
## 3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :204.80 Max. :57.50 Max. :1.4967 Max. :1.0380
From the plot we can the following:
Only the expected years of education eye variable is close to the normal ditribution, other variables are skewed.
Let’s perform principal component analysis for our dataset
# performing principal component analysis (with the SVD method)
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## gni -9.999832e-01 -0.0057723054 5.156742e-04 -4.932889e-05 -0.0001135863
## leb -2.815823e-04 0.0283150248 -1.294971e-02 6.752684e-02 0.9865644425
## eye -9.562910e-05 0.0075529759 -1.427664e-02 3.313505e-02 0.1431180282
## mmr 5.655734e-03 -0.9916320120 -1.260302e-01 6.100534e-03 0.0266373214
## abr 1.233961e-03 -0.1255502723 9.918113e-01 -5.301595e-03 0.0188618600
## prp -5.526460e-05 0.0032317269 7.398331e-03 9.971232e-01 -0.0716401914
## edu.rat -5.607472e-06 0.0006713951 3.412027e-05 2.736326e-04 -0.0022935252
## lab.rat 2.331945e-07 -0.0002819357 -5.302884e-04 4.692578e-03 0.0022190154
## PC6 PC7 PC8
## gni 2.711698e-05 8.075191e-07 -1.176762e-06
## leb 1.453515e-01 -5.380452e-03 2.281723e-03
## eye -9.882477e-01 3.826887e-02 7.776451e-03
## mmr -1.695203e-03 -1.355518e-04 8.371934e-04
## abr -1.273198e-02 8.641234e-05 -1.707885e-04
## prp 2.309896e-02 2.642548e-03 2.680113e-03
## edu.rat -2.180183e-02 -6.998623e-01 7.139410e-01
## lab.rat -3.264423e-02 -7.132267e-01 -7.001533e-01
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.6), col = c("brown", "blue"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
The result can’t be explained, so we need to scale the data.
# scaling the variables
human_scaled <- scale(human)
summary(human_scaled)
## gni leb eye mmr
## Min. :-0.9193 Min. :-2.7188 Min. :-2.7378 Min. :-0.6992
## 1st Qu.:-0.7243 1st Qu.:-0.6425 1st Qu.:-0.6782 1st Qu.:-0.6496
## Median :-0.3013 Median : 0.3056 Median : 0.1140 Median :-0.4726
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.6717 3rd Qu.: 0.7126 3rd Qu.: 0.1932
## Max. : 5.6890 Max. : 1.4218 Max. : 2.4730 Max. : 4.4899
## abr prp edu.rat lab.rat
## Min. :-1.1325 Min. :-1.8203 Min. :-2.8189 Min. :-2.6247
## 1st Qu.:-0.8394 1st Qu.:-0.7409 1st Qu.:-0.5233 1st Qu.:-0.5484
## Median :-0.3298 Median :-0.1403 Median : 0.3503 Median : 0.2316
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6030 3rd Qu.: 0.6127 3rd Qu.: 0.5958 3rd Qu.: 0.7350
## Max. : 3.8344 Max. : 3.1850 Max. : 2.6646 Max. : 1.6632
We can see that all the means equal zero.
Then we perform PCA again but on standardized data
# perform PCA (SVD method)
human_pca <- prcomp(human_scaled)
human_pca
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5 PC6
## gni -0.35048295 -0.05060876 0.20168779 -0.72727675 0.4950306 -0.11120305
## leb -0.44372240 0.02530473 -0.10991305 -0.05834819 -0.1628935 0.42242796
## eye -0.42766720 -0.13940571 0.07340270 -0.07020294 -0.1659678 0.38606919
## mmr 0.43697098 -0.14508727 0.12522539 -0.25170614 0.1800657 -0.17370039
## abr 0.41126010 -0.07708468 -0.01968243 0.04986763 0.4672068 0.76056557
## prp -0.08438558 -0.65136866 -0.72506309 0.01396293 0.1523699 -0.13749772
## edu.rat -0.35664370 -0.03796058 0.24223089 0.62678110 0.5983585 -0.17713316
## lab.rat 0.05457785 -0.72432726 0.58428770 0.06199424 -0.2625067 0.03500707
## PC7 PC8
## gni -0.13711838 -0.16961173
## leb -0.43406432 0.62737008
## eye 0.77962966 -0.05415984
## mmr 0.35380306 0.72193946
## abr -0.06897064 -0.14335186
## prp 0.00568387 -0.02306476
## edu.rat 0.05773644 0.16459453
## lab.rat -0.22729927 -0.07304568
# drawthe biplot from the PCA result and the original variables
biplot(human_pca, choices = 1:2, cex = c(0.4, 0.6), col = c("brown", "blue"))
Now the plot looks much better in terms of explanation.
We can conclude the following points:
Personal interpretations of the first two principal component dimensions:
We download the tea dataset
library(FactoMineR)
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# new dataset with selected columns
tea_time <- dplyr::select(tea, one_of(keep_columns))
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
Then we visualize the dataset:
library(ggplot2)
library(tidyr)
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Now we performe the MCA:
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
We can see the strongest correlation of “how” and “where” with the first dimension and these variables are also correlated with the second dimension.
Let’s visualize MCA by categories and individuals:
# by individuals
plot(mca, invisible=c("ind"), habillage = "quali")
# by individuals
plot(mca, invisible=c("var"), habillage = "quali")
We can see that “how” and “where” are the most similar categories. We can conclude that
First dimension explains 15% of variance, and the second dimension - 14% of variance.